[SPARK-13070][SQL] Better error message when Parquet schema merging fails #10972

liancheng · 2016-01-29T02:00:07Z

Now we also report path and schema of the file in trouble.

rxin · 2016-01-29T02:08:03Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala

how about we just use a foreach on footers and then combine all of these in a single pass? Seems simpler.

i.e. something like

var mergedSchema = StructType(Nil) footers.foreach { file => val schema = ParquetRelation.readSchemaFromFooter(footer, converter) try { mergedSchema = mergedSchema.merge(schema) } catch { ... } } mergedSchema

SparkQA · 2016-01-29T03:06:57Z

Test build #50331 has finished for PR 10972 at commit bfe4987.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-01-29T03:07:04Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala

Can we get the schema from first footer and then go through this loop for remaining footers? Because you merge the first schema with an empty schema, I think the all fields in merged schema will be optional in their metadata. So the pushing down of filters will not normally work.

Yeah you'll right, filter push-down can be affected due to #9940 (which I just merged today). Thanks for pointing this out!

SparkQA · 2016-01-29T03:20:03Z

Test build #50327 has finished for PR 10972 at commit 1ce61a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-29T05:11:19Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala

can we catch something tighter here? This is too broad.

also i'd prefer to make this more concise, e.g.

throw new SparkException( "Failed merging schema of file ${footer.getFile}: ${schema.treeString}", cause)

I don't know why these functions are super long.

rxin · 2016-01-29T07:18:44Z

@viirya do you want to submit a pull request to address your issue? @liancheng is busy with something right now.

i.e. create a pull request that contains this one and set the 1st

SparkQA · 2016-01-29T07:22:56Z

Test build #50354 has finished for PR 10972 at commit 2e5eddb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-01-29T07:26:06Z

@viirya It would be great if you can help since you are pretty familiar with this part of code :)

liancheng · 2016-01-29T07:37:05Z

I'm closing this one.

viirya · 2016-01-29T07:54:41Z

@rxin @liancheng yes, I would love to do it. Thanks.

Better error message when Parquet schema merging fails

1ce61a4

rxin reviewed Jan 29, 2016
View reviewed changes

Addresses PR comments

bfe4987

liancheng force-pushed the schema-merging-failure-message branch from 0ff9651 to bfe4987 Compare January 29, 2016 02:31

viirya reviewed Jan 29, 2016
View reviewed changes

rxin reviewed Jan 29, 2016
View reviewed changes

Addresses PR comments

2e5eddb

liancheng closed this Jan 29, 2016

[SPARK-13070][SQL] Better error message when Parquet schema merging fails #10972

[SPARK-13070][SQL] Better error message when Parquet schema merging fails #10972

Uh oh!

Conversation

liancheng commented Jan 29, 2016

Uh oh!

rxin Jan 29, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Jan 29, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 29, 2016

Uh oh!

viirya Jan 29, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng Jan 29, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 29, 2016

Uh oh!

rxin Jan 29, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Jan 29, 2016

Choose a reason for hiding this comment

Uh oh!

rxin commented Jan 29, 2016

Uh oh!

SparkQA commented Jan 29, 2016

Uh oh!

liancheng commented Jan 29, 2016

Uh oh!

liancheng commented Jan 29, 2016

Uh oh!

viirya commented Jan 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants